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Abstract 

Consider a photon that has just emerged from a hnear polarizing filter. If the photon is then 
subjected to an orthogonal polarization measurement — e.g., horizontal vs vertical — the photon's 
preparation cannot be fully expressed in the outcome: a binary outcome cannot reveal the value of 
a continuous variable. However, a stream of identically prepared photons can do much better. To 
quantify this effect, one can compute the mutual information between the angle of polarization and 
the observed frequencies of occurrence of "horizontal" and "vertical." Remarkably, one finds that 
the quantum-mechanical rule for computing probabilities — Born's rule — maximizes this mutual 
information relative to other conceivable probability rules. However, the maximization is achieved 
only because linear polarization can be modeled with a real state space; the argument fails when 
one considers the full set of complex states. This result generalizes to higher dimensional Hilbert 
spaces: in every case, one finds that information is transferred optimally from preparation to 
measurement in the real-vector-space theory but not in the complex theory. Attempts to modify 
the statement of the problem so as to see a similar optimization in the standard complex theory are 
not successful (with one limited exception). So it seems that this optimization should be regarded 
as a special feature of real-vector-space quantum theory. 



I. INTRODUCTION 



In 1936 Birkhoff and von Neumann initiated an axiomatic approach to the foundations 
of quantum mechanics, taking as their starting point postulates inspired by classical logic 
but adapted to the peculiar features of quantum theory [1] . Though they showed that many 
characteristics of quantum theory could be captured in this way, they could also see that 
their logical approach would not lead uniquely to standard quantum theory. In particular 
they noted that along with standard complex-vector-space quantum theory, the postulates 
could just as well be satisfied by a theory based on a real or quaternionic Hilbert space [IB]. 

Over the years other authors have taken other approaches to axiomatization and have 
found reasonable assumptions that favor the complex theory over the real and quaternionic 
models. One successful strategy along these lines has been to insist on the existence of 
an uncertainty principle of a specific form [5H7]. Another approach put forward by several 
authors relies on the fact that in standard quantum theory, it is possible to carry out a 
complete tomographic reconstruction of the state of a multipartite system entirely by means 
of local measurements on the individual components (taking into account correlations), 
with no need for global measurements on pairs of subsystems [SHTi]. The real- vector-space 
theory does not have this property; so by adopting local tomography as an axiom, one 
rules out the real case. Surely, though, much of the appeal of these arguments comes from 
the fact that they succeed in leading us to what we believe to be the correct answer. If 
we had found ourselves living in a world that seemed to be well described by real-vector- 
space quantum theory, we would not have regarded it as a logical problem that tomography 
requires global measurements. It would simply be another peculiar feature of quantum 
theory, like entanglement [17] . (I admit, though, that the local tomographic property of the 
complex theory does feel as if it could be a clue to something deeper.) 

In this paper I would like to point out a particular property of real-vector-space quantum 
theory that I find especially intriguing: the transfer of information from a preparation to a 
measurement is optimal (in a sense to be explained below). Standard quantum theory does 
not have this property. So if we were trying to find a simple set of axioms that would generate 
real- vector-space quantum theory, we might well find ourselves adopting optimal information 
transfer as one of our axioms. This property of the real theory has been known for years — it 
appears in my 1980 doctoral dissertation — but I would like to give a somewhat simpler 
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and more intuitive presentation of it here. 

One motivation for studying real- vector-space quantum theory is simply to shed light on 
the standard theory by comparison. But I would also like to keep open the possibility that 
the real-vector-space theory might turn out to be of value in its own right for describing 
our world. Several authors have given us reasons for not discounting this possibility. In 
a series of papers published around 1960, Stueckelberg and his collaborators developed an 
alternative formulation of quantum field theory based on a real Hilbert space [H [T71 IT8] . 
In order to allow the existence of an uncertainty principle, Stueckelberg imposes a specific 
restriction on all the observables of the real-vector-space theory: every observable is required 
to commute with a certain operator that we can write as / ® J, where J is the 2x2 matrix 

and / is the identity operator. (In the context of Stueckelberg's papers / is the 

V ° / 

identity on an infinite-dimensional real Hilbert space.) In effect, this restriction forces the 

matrix representing any observable to be composed of 2 x 2 blocks of the form 

Such 2x2 blocks add and multiply like complex numbers; so the theory becomes equivalent 
to the usual complex theory. One of the points Stueckelberg and his collaborators make in 
these papers is that in this formulation the time-reversal operator becomes linear, rather 
than antilinear as in the complex formulation. Around the same time, Dyson made the same 
point and argued that by bringing the time-reversal operator into our formalism, we are in 
effect basing our quantum theory on the field of real numbers [IQ]. 

More recently Gibbons and his collaborators have argued that the complex structure in 
quantum theory is intimately related to the classical idea of time, and that both time itself 
and the associated complex structure could prove to be emergent features [201 EI] • In other 
work, Myrheim has pointed out that if one wants a version of the canonical commutation 
relation [x, p\ = ih in a discrete system with finitely many values of position and momentum, 
one cannot use standard complex quantum theory: the trace of any commutator is zero in 
a finite-dimensional space, but the trace of ih is not zero. On the other hand, if we replace 
ih with Jh (the same J as above), both sides of the equation have zero trace and there 
is no contradiction ^2]. In the present paper I do not particularly build on any of these 
observations except insofar as they suggest that a real-vector-space version of quantum 
theory might be used to describe our actual world, and that the theory is worth studying 
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for this reason as well as for whatever insights it might provide about standard quantum 
mechanics. 

I begin in Section II by saying what I mean by "real- vector-space quantum theory." Then 
in Sections III and IV I present the property of optimal information transfer, first for a 
two-dimensional state space and then in d dimensions. As I have said, standard complex 
quantum theory does not have this property, and it is interesting to ask whether a revised 
statement of the problem might yield a positive answer even in the complex case. This is 
the subject of Section V. Section VI then summarizes our findings. 

II. REAL- VECTOR-SPACE QUANTUM THEORY 

One can summarize the basic structure of standard quantum theory in the following four 
statements: 

1. A pure state is represented by a unit vector in a Hilbert space over the complex 
numbers. 

2. An ideal repeatable measurement is represented by a set of orthogonal projection 

operators whose supports span the vector space. When a state |s) is subjected to the 
measurement {Pi, ■ ■ ■ , Pm}-, the probability of the ith outcome is {s\Pj\s). When the 
ith outcome occurs, the system is left in a state proportional to Pi\s). 

3. A reversible transformation is represented by a unitary operator U . That is, for any 
initial state the operation takes \s) to U\s). 

4. A composite system has as its state space the tensor product of the state spaces of its 
components. 

Of course other states, measurements and transformations are possible. Mixed states are 
averages of projection operators on pure states, and there also exist non-orthogonal measure- 
ments and irreversible transformations. But all such generalizations can be obtained from 
the cases listed above by applying them to a larger system and possibly discarding part of 
the system. I have chosen the above formulation partly to keep the discussion simple, but 
also because I do tend to think of orthogonal measurements and pure states as being more 
fundamental than their generalizations. 
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The real-vector-space theory has essentially the same structure, except that all vectors 
and matrices are limited to real components. The only changes in the above list are that 
"complex" is to be replaced by "real" in item 1, and "unitary" is to be replaced by "orthog- 
onal" in item 3. 

One might wonder what the analogue of the Schrodinger equation is in the real-vector- 
space theory. The Schrodinger equation generates a unitary transformation through a Her- 
mitian operator, the Hamiltonian: 

ihj^\s) = H\s). (1) 

If H is time independent, the unitary operator it generates over a time t is U{t) = e"*-'^*/'*, 
since \s(t)) = U(t)\s{0)) solves the differential equation. The analogous equation in the 
real-vector-space case should have an antisymmetric real matrix in place of —iH, since such 
a matrix generates orthogonal transformations. We can write the differential equation as 

|k> = S\s), (2) 

where S is an antisymmetric real operator. I like to call 5* the "Stueckelbergian" in honor 
of Ernst Stueckelberg (who of course did not use this term). If S is time independent, then 
the general solution of Eq. ^ is \s(t)) = e'^*|s(0)). 

Another reasonable question is whether, for example, in a two-dimensional real space the 
operator 

should be allowed to count as a possible transformation [23]. It is an orthogonal matrix, 
so according to the above rules it does count. But there is no 2 x 2 Stueckelbergian that 
can generate this operator. This is because the operator R represents a reflection, not a 
rotation, and there is no continuous set of orthogonal transformations on a two-dimensional 
real space that takes us from the identity operator to a reflection operator. 

Nevertheless, in a real- vector-space world it would still be possible to realize the operation 
R continuously by bringing in an ancillary two-dimensional system (that is, an ancillary 
"rebit"). To effect the transformation 

(4) 
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we can perform a controlled rotation on our ancillary rebit, conditioned on the state of the 
original rebit. By rotating the ancillary rebit by half a complete cycle, we can pick up the 
desired factor of —1. So it seems reasonable to allow orthogonal matrices with negative 
determinant to count as possible transformations. 

III. OPTIMAL TRANSFER OF INFORMATION: THE TWO-DIMENSIONAL 
CASE 

Consider the following simple scenario. A stream of photons emerges from a linearly 
polarizing filter with its preferred axis oriented at an angle 6 from the horizontal. Somewhere 
further along the photons' path there is a polarizing beam splitter and a pair of single-photon 
detectors, which together force each photon to yield either the horizontal outcome or the 
vertical outcome. The probability of "horizontal" is Pq{0) — cos^O. (The subscript "0" 
distinguishes this function from other hypothetical functions to be considered shortly.) This 
function allows someone observing the measurement results to gain information about the 
angle 9. 

This scenario illustrates a typical feature of a quantum measurement: a measurement on a 
single instance of a system (in this case a single photon) cannot convey complete information 
about the system's preparation. But a large statistical sample of measurements on identically 
prepared copies can eventually home in on the values of the preparation parameters (in this 
case the single parameter 9). One does not encounter this limitation in classical physics, 
at least not for pure states: if a particle is placed at position x with momentum p, a 
measurement can directly reveal those values. This difference between classical and quantum 
physics reflects the fact that quantum theory is inherently probabilistic. 

In our specific example, one can ask how well the information about 9 is conveyed through 
the observed results. Specifically, one can quantify the mutual information between the 
measurement results and the value of ^. As we will see shortly, given the limitation imposed 
by the probabilistic nature of the polarization measurement, the transfer of information is 
optimal in the limit of a large number of trials. That is, in this example anyway, quantum 
mechanics orchestrates the optimal conveyance of information from the preparation to the 
measurement outcome. 

Before justifying this statement, I want to note the sense in which we are effectively 
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framing the problem in the real-vector-space theory. By limiting the possibilities to linear 
polarizations, we are ruling out all the polarization states one would normally represent 
with vectors having a nonzero imaginary part (circular and elliptical polarizations). We will 
return to this point toward the end of this section. 

Now let us make the statement precise. We do so by comparing our actual world, in which 
the probability of "horizontal" is po{6) = cos^ 6, to a fictitious world in which the probability 
is given by some arbitrary function p{6). In such a world, let photons, each prepared 
with linear polarization angle 6, be subjected to a horizontal-vs-vertical polarization mea- 
surement. Let n be the number of these photons that yield the outcome "horizontal." The 
mutual information between the measurement results and the value of 6 is based on the 
Shannon entropy H and is defined to be 

I{n:e) = H{n)-H{n\e) = -^P{n)\nP{n) + — I [^P{n\e)\nP{n\e)\ dO. (5) 

n=0 \n=0 / 

Here we have assumed a uniform a priori distribution of 9 over the interval [0,27r]. (This 
is a crucial assumption that we discuss further below.) P{n\9) is the probability of getting 
the horizontal outcome exactly n times if the photons are prepared in the state 6', and 
P{n) is the probability of getting the horizontal outcome exactly n times in the absence of 
any information about 9 (that is, when 9 is uniformly distributed). Both P{n\9) and P{n) 
depend on the function p{9). Note that in Eq. (|5| we have written the mutual information 
as the average amount of information gained about the integer n upon learning the value 
of 9. It does have this interpretation, but it can alternatively be interpreted as the average 
amount of information one gains about the value of 9 upon learning the value of n. (Mutual 
information is symmetric in its two arguments.) This latter interpretation is more descriptive 
of the scenario we are imagining, in which an observer at the polarizing beam splitter is trying 
to learn about the value of 9. 

It turns out that for large I{n : 9) grows as (1/2) In A^. We therefore consider the 
following limit, which has a finite upper bound: 



lim 

N^oo 



I{n ■.9)-]^\nN 



(6) 



We want to show that of all conceivable probability functions p{9), the quantum mechanical 
function Pq{9) = cos^ 9 gives / its largest possible value. 
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At this point we could proceed to compute / starting from Eq. (|5j), but the calculation will 
be simpler, and I hope clearer, if we abstract the problem away from its quantum mechanical 
setting. The important point to notice is that I{n : 6) depends on the probability function 
p{6) only through the measure it induces on the binary probability space. That is, before 
we have any knowledge of 6, we can use p{d) and the assumed uniform distribution of 6 to 
figure out how likely it is that the probability of "horizontal" lies in any given interval, and 
it is this weighting function that figures into I{n : 6). 

The more abstract problem, then, can be stated as follows. Consider a two-outcome 
probabilistic experiment, and let {pi,P2) denote a point in the binary probability space 
with pi corresponding to outcome #1. The experiment is run times, and outcome #1 
is observed to occur n times. This observation gives the experimenter information about 
{pi,P2) [48]. The mutual information I between the value of n and the value of {pi,P2) 
depends on the experimenter's a priori measure on probability space. Our problem is to 
find the a priori measure that maximizes the limit 



(The optimal measure will turn out to be unique.) We want to show that this optimal 
measure is the one induced by the quantum probability function po{0) = cos^ 6 when 6 is 
uniformly distributed. 

In order to tackle this problem we need to choose a parameterization of the binary 
probability space. We could use pi or p2 as our parameter, but it turns out to be more 
convenient to use a different parameter a defined by (pi,P2) = (cos^ a, sin^ a), where < 
a < 7r/2. The relation between a and (pi,P2) is illustrated in Fig. [TJ (One might object 



lim I In 



(7) 





FIG. 1: The relation between (pi,P2) and a. 



8 



that we seem to be smuggling some quantum mechanics into the calculation here, but we 
are not. The results will be entirely independent of our choice of parameter. Our choice 
merely simplifies the calculation.) Let K{a)da be the a priori measure on the set of values 
of a, normalized so that JJ^^^ K{a)da = 1. The mutual information between a and n can 
be written as 

I{a : n) = h{a) — h{a\n) = — / K{a)lio.K{a)da + '^^P{n) / P {a\n) In P{a\n) da, 

Jo „=o -^0 

(8) 

where h{a) and h{a\n) are differential entropies |19]. Here P{a\n) is the probability dis- 
tribution the experimenter assigns to a after seeing the value n, and P{n) is the a priori 
probability of the value n as computed from the distribution K{a). (Note that if K{a) is 
derived from the probability function p{6) under the assumption that 6 is uniformly dis- 
tributed, then I{a : n) is exactly equal to the quantity I{n : 9) given in Eq. Q.) The 
point of the next paragraph is to show that under modest assumptions about the function 
K{a), in the limit of very large the second term on the right-hand side of Eq. ([s]) becomes 
independent of K{a). So we will only have to think about maximizing the first term. 

To evaluate this second term, we need to write down expressions for P{n) and P{a\n). 
We have 

P[n) = / P{n\a)K{a)da (9) 
Jo 

and 

, , , P(n\a)K(a) , . 

Picy\n) = ^ ^^^^^^ \ (10) 

where P{n\Q) is given by the binomial distribution: 

with pi = cos^ a and p2 = sin^ a. For any value of pi strictly between and 1, it is possible 
to choose N large enough that the binomial distribution is well approximated by a Gaussian: 



P{n\a) ~ = exp 

y/27rNpiP2 



{n/N-piY 



(12) 



2piP2 

As gets very large this distribution, regarded as a function of n/N, becomes arbitrarily 
highly peaked around n/N = pi. Let a*^"^ be defined so that n/N = cos^ a^'^\ That is, a*-"^ 
is the value of a corresponding to the observed outcome n. Then in the above exponent, we 
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can approximate the quantity {n/N — pi) as 

d ( ^ ) 

{n/N -pi) = cos^ a^"^ - cos^ a ^ — ^- -Aa = (-2 cos a sin a) Aa = -2VpiP2 Aa, (13) 

da 

where Aa = a^""^ — a. This gives us 

Inserting this expression into Eq. ^ , we again use the fact that the Gaussian is very highly 
peaked so that we can (i) extend the integral from — oo to oo without changing its value 
appreciably and (ii) evaluate everything outside the exponential at a = a^"'\ Then we get 

P(n) ^ TT- (15) 

^ ^ 2A^cosa(")sina(") ^ ^ 



We now use Eqs. (10), (14) and (15) to approximate P{a\n): 



P{a\n) ^ y ^ exp [-2N{Aa)'^] . (16) 

Using this expression and again relying on the narrowness of the Gaussian, we get 

n/'^ 1 /2N\ 

/ P{a\n)lnP{a\n)da^ -In ( . (17) 

Jo 2 \ne J 



Since this expression does not depend at all on n, it factors out of the sum in Eq. (|8j), so 
that the only sum we have to do is Yin ^i^)^ which is unity by definition. Putting the pieces 
together, we arrive at 

f7r/2 



I(a : n) ^ 



/ K{a) In K{a)da + - In . (18) 

Jo 2 \ Tie J 



And then subtracting (1/2) In as in Eq. ([T]) gives us 

r/"^ 1 f 2\ 

/ = -/ K{a)\nK{a)da+ — ] . (19) 
Jo 2 \TxeJ 

The equality holds as long as our approximations become arbitrarily good as N gets larger. 

This will indeed be the case if the function K{a) is reasonably well behaved. A sufficient set 

of conditions on K{a) is that it be positive and differentiable on the interval [0, 7r/2]. Then 

when many trials are run, the range of likely values of a narrows to such a degree that the 

final distribution P{a\n) does not depend appreciably on the a priori distribution K{a). 

The problem has now been reduced to finding out what distribution or distributions K{a) 

maximize the quantity — j^^"^ K{a) \YiK{a)da. The answer to this question is well known: 
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the unique maximizing distribution is the uniform distribution K{a) = 2/tt. This result 
follows from the fact that the function (f){x) = — a;lnx is a strictly concave function of x for 
all positive values of x. Jensen's inequality then tells us that 



(2/7r) / (P[K{a)]da< 
Jo 



■n/2 

(2/7r) / K{a)da 



0(2/7r), (20) 



with equality holding only for the constant function K{a) = 2/71 [50j. 

Now we compare our result to quantum mechanics. Is this uniform distribution over a the 
one induced by the quantum probability law po{6) = cos^ 6, when 6 is uniformly distributed? 
First consider the values of 6 from to 71 /2. In that range the law po{6) = cos^ 6 mirrors the 
definition of a and we have a = 6. (1 am taking pi to correspond to the horizontal outcome.) 
So a uniform distribution of 6 over this range would induce the uniform distribution of a. In 
the other three quadrants of the circle, that is, in the rest of the range of 6, the parameter a 
is not equal to 6 but we still have \da/d6\ = 1 (except at a finite number of points where a 
"bounces" off one of the endpoints of its range). Thus when 6 is uniformly distributed, so is 
a. This completes our demonstration that the quantum probability function po{6) = cos"^ 6 
is optimal. 

Is the function po{6) = cos"^ 6 unique in this respect? The answer is no. Any function 
p{6) that yields the same a priori measure on the binary probability space will be equally 
good. For example, any function of the form p{6) = cos^(m6'/2) where m is a non-negative 
integer yields the same distribution K{a) = 2/7c. And there are many other, less physically 
interesting examples. Still, a typical function p{6) will not have this optimization property. 

Looking back over the above argument, one can see that the crucial feature is the exponent 



in Eq. (14): the coefficient of (Aa) depends only on and not on a itself. In other words, 
the spread in the value of a'-"'' depends only on the number of trials (when this number is 
large), and not on the probabilities (pi,P2)- This is what is special about parameterizing 
probability space with the parameter a: it makes the statistical spread uniform. Once we 
have this fact, it is guaranteed that the final differential entropy h{a\n) will not depend 
on K{a). Therefore to maximize the mutual information, we want to maximize the initial 
differential entropy h{a) and we are thereby led to the uniform distribution. 

There is perhaps a more direct way of seeing what is special about the function po{6) = 
cos^ 9. First note that the spread in n itself is not uniform over probability space. If one 
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performs the binary experiment times, the standard deviation in n/N is given by 



HniN) = (21) 
which is smaller near the ends of probability space than near the middle. One can see 



this dependence in the exponent in Eq. (12). In our polarization experiment, an observer 
recording the frequency of occurrence of "horizontal" will therefore be more certain of the 
probability of "horizontal" when that probability is close to zero or one. (Again I am assuming 
that the experimenter's a priori distribution over probability space is reasonably smooth and 
that the number of trials is large.) On the other hand, upon translating the uncertainty in 
probability to an uncertainty in 6, the observer must use the function p{6). For the special 
case of Poid) = cos^ 6, the slope of this function exactly compensates for the varying size of 
A{n/N), so that the size of the resulting "region of uncertainty" of 6 is independent of the 
value of n/N. Specifically, 

d 



2 |cos 9 sin 91 = 2^p„{e)\\ - n(e)], (22) 



which perfectly matches the dependence seen in Eq. (21). This compensation is illustrated 
in Fig. |2} Thus the Born rule has the effect of equalizing the final uncertainty in 9 over all 
values of 9. It is plausible that this even-handed strategy will be optimal, and indeed we 
have just seen that it is. 

We now consider the case in which all pure polarization states are possible. The full 
set of pure states is the Bloch sphere — it includes the circular and elliptical polarizations — 
and the natural a priori measure is the uniform measure on the sphere, since this is the 
only probability measure invariant under all unitary transformations. We imagine a device 
that prepares a beam of photons in one of these polarization states, and further along the 
photons' path we imagine a person making the horizontal-vs-vertical measurement on each 
photon. The polarization is now determined by two parameters; for definiteness let us take 
them to be the polar angle (5 and the azimuthal angle 0, and let the north and south 
poles of the sphere correspond to horizontal and vertical polarization. It is still possible 
to define the mutual information between the photons' preparation and the measurement 
outcomes; it could be written as I{n : /3,0). Again this mutual information is the same as 
the quantity I{a : n) given in Eq. ([s]) and it is maximized only if the a priori distribution 



of a is the uniform distribution K{a) = 2/n. But now the quantum mechanical law does 
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FIG. 2: The uncertainty in 6 for two different values of n/N. Notice tliat tlie slope of the cosine- 
squared curve exactly compensates for the varying size of A{n/N), so that the "region of uncer- 
tainty" in 6 has the same size for all values of n/N. (Here 9 is plotted only up to vr to make the 
diagram simpler.) 

not yield this distribution over the values of a. With the uniform distribution over the 
Bloch sphere, the parameter cos /3 is uniformly distributed over the interval [—1,1], and 
the quantum mechanical probability of "horizontal," p{6) = (1/2)(1 + cos/3), is therefore 
uniformly distributed over the interval [0, 1]. To get the corresponding distribution of a, we 
use the relation pi = cos^ a and the assumption that pi is uniformly distributed: 

dpi 



K{a) 



da 



2 cos a sin a. 



(23) 



Thus, rather than giving us the uniform distribution of a, the full Bloch sphere gives us a 
distribution that has a maximum in the middle of a's range. 

We can see directly that this distribution does not allow as much information transfer as 



the optimal distribution. The relevant quantity is the integral in Eq. (19): 

f.7r/2 

K{a) \iiiK{a)da. 



(24) 



For the uniform distribution over a, this quantity has the value ln(7r/2) = 0.452, whereas 
for the distribution K{a) = 2 cos a sin a, we get 1 — In 2 = ln(e/2) = 0.307. 
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Just as the uniform measure over the surface of the Bloch sphere is natural because it is 
invariant under all unitaries, in the real-vector-space theory where the set of pure states in 
two dimensions traces out a circle rather than a sphere, the uniform distribution over the 
circle is natural because it is invariant under all orthogonal transformations (rotations and 
reflections). That is, in the real- vector-space theory, we can use this invariance to justify 
our original assumption that the angle 9 is uniformly distributed over the interval [0, 2t[]. 



IV. OPTIMAL TRANSFER OF INFORMATION: THE d-DIMENSIONAL CASE 

The above argument extends to a d-dimensional real vector space. Let a "redit" be a 
hypothetical quantum object whose pure states are vectors in a d-dimensional vector space 
over the real numbers. We now imagine an experiment in which a beam of redits is 
prepared in a specific pure state |s). At some point further along the beam, an observer 
makes a fixed complete orthogonal measurement on each redit. The observer records the 
integers ni, . . . , n^, where is the number of times the zth measurement outcome occurs. We 
ask how much information the observer learns on average about the preparation assuming 
(crucially) that the vector \s) is initially distributed uniformly over the unit sphere in the d- 
dimensional space. Again this average information gain is given by the mutual information, 
which we will write down shortly. The mutual information depends on the law that specifies 
the probability of the ith outcome given the preparation In real- vector-space quantum 
theory, this law can be expressed as 

p,{\s)) = sl i = l,...,d, (25) 

where si, . . . ,Sd are the components of |s) in the basis defined by the measurement. 

As before, what really matters in computing the mutual information is the a priori 
measure on probability space. The uniform measure over the unit sphere in d dimensions, 



together with Eq. (25), defines some specific a priori measure on probability space. We also 



want to consider other a priori measures, in order to show that the one induced by Eq. (25) 
is optimal. The probability space is now a. {d — l)-dimensional set, since the probabilities 
must add to one. We could parameterize this set by the probabilities pi, . . . ,Pd-i of the 
first d — 1 outcomes, but we instead choose to label the points of probability space by a 
unit vector 7 = (Vpi, • • • , \/Pd)- (Note that each y/pi is non-negative; so 7 is confined to 
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the positive part of the unit sphere.) We could go further and choose d — 1 specific angular 
coordinates to locate this vector on the sphere (like the a of the preceding section), but we 
will not need to do so. Let K{'y)d'y be a generic a priori probability measure on the set of 
vectors 7, where d'j is an infinitesimal {d — l)-dimensional surface element on the positive 
section of the unit sphere. Our goal is to find the distribution -^'(7) that maximizes the 
mutual information. 

That mutual information can be written as follows: 



/(7 : ft) = h{j) - h{^\n) 



- j K{^)\nK{^)d^ +^P{n) j P{^\n)\nP{^\n)d^, (26) 



where n = {rii, . . . , na) specifies the number of times each outcome occurs. The sum is over 
all vectors n for which each Ui is a non-negative integer and ni + ■ ■ ■ + Ud = N . The mutual 
information is now based on the multinomial distribution: 



nil--- Udl 



(27) 



(Here pj = 7J.) The functions appearing in Eq. (26) can be obtained from P{n\^) as follows: 



and 



Pin) 



Pirn 



P{n\^)K{j)dj 



P{n\l)K{l) 
Pin) 



(28) 



(29) 



As gets large, it will turn out that 1(7 : n) grows as [{d — l)/2] In A^. So we will compute 
the limiting value 



lim 

71— ^00 



Ii^:n) - ( 'LA 1 inAT 



(30) 



At this point the calculation is very similar to the one in the preceding section. As we did 
for the analagous equation in that case, we now show that the second term on the right-hand 



side of Eq. (26) becomes independent of if (7) as A^ approaches infinity. 



For any fixed positive values of pi, . . . and for large enough A^, Eq. (27) can be ap- 
proximated arbitrarily well by a Gaussian function: 



P(n|7)^ [{2'KN)^-^p,P2---pd] '^'exp 



N 



E 

i=l 



[m/N -pif 
Pi 



(31) 



It will be helpful to define the vectors 



7 



(n) 




and 



A7 = 7 



(n) 



7- 



(32) 
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That is, 7*^"'-* is the vector of square roots of the observed frequencies of occurrence, whereas 
7 is the vector of square roots of the probabihties. The difference A7 between these two 
vectors is hkely to be small when is large; so we will keep just the lowest-order term in 



this quantity. We can then rewrite the sum in the exponent of Eq. (31) as follows: 

d 



Pi 



Pi 



A7 



(33) 



i=l " ' i=l 

where Api is defined to be (n j / A^) — pi and the last step comes from 



We can therefore approximate Eq. (31) as 
P{n\i) ^ (27riV) 



exp 



-2N 



A7 



(34) 



(35) 



7i72 ■ ■ ■ 7d 

That is, P(n|7), regarded as function of 7, falls off as a Gaussian around the point 7^"-*, 
with a spread that is isotropic and independent of the value of (This function is not, 
however, a normalized distribution of 7; rather, it is normalized with respect to a sum over 
n.) 



In approximating the integral in Eq. (28), we rely on the narrowness of the Gaussian: 



the integral is over a section of a sphere, but we can treat it as being over an infinite flat 
space having d — 1 dimensions — the "plane" tangent to the sphere at the point 7^"'^ We also 
evaluate everything outside the exponential at the point 7^"^. These approximations give us 

ir(7(")) 



P(n) 



(2iV)- 



-{d-i)_ 



(n) (n) (n) ' 

7i 72 ■ ■ ■ 7d 



(36) 



Inserting this expression and Eq. (|35|) into Eq. (|29|), we get 

P(7|n 



A7 



(37) 



We can now do the second integral in Eq. (26), again treating the integral as if it were over 



an infinite {d — l)-dimensional flat space. The result is 



/(7 : n) 



d 



Finally we subtract [{d — l)/2] In as in Eq. (30) to get 

K{^)\nK{j)dj + ^ ^ 



(38) 



(39) 



16 



Note that Eq. (19) is a special case of this equation, with d = 2. As in that case, the 



expression is maximized by choosing -^'(7) to correspond to the uniform distribution: 

K^,{7) = -^t^- (40) 



The constant on the right-hand side of Eq. (40) is the reciprocal of the "surface area" of the 
positive section of the unit sphere. 

The question now is whether the probability rule in real-vector-space quantum theory. 
Pi = sf, induces the measure on probability space given by Kopt{l)- The answer is yes, as 
is easily seen. The state vector \s) ranges over the full unit sphere in M'^, but consider for 
now just the section of the sphere in which each Si is positive. In that section the vector 7 
is equal to the vector |s), since Pi = 'jf = sf. So the uniform distribution of |s) over this 
section of the sphere induces the uniform distribution of 7. The whole unit sphere in 
consists, in effect, of 2*^ copies of the positive section. So indeed, the uniform distribution 
of |s) over the whole sphere does correspond to the uniform distribution of 7 expressed in 



Eq. (40). That is, the transfer of information from preparation to measurement is optimized 
in this d-dimensional case, just as it was in the two-dimensional case. 

The complications involved in the information-theoretic calculation may obscure what 
is really a simple underlying fact. Imagine probability space as a (rf — l)-dimensional flat 
"surface" in a d-dimensional space with orthogonal axes labeled pi,...,pd. The surface 
consists of all points (pi, . . . ,Pd) such that each pi is non- negative and the sum pi + ■ ■ ■ + 
Pd is equal to 1. Around each point on this surface one can imagine a small "region of 
uncertainty," representing the spread in the actual frequencies of occurrence of the d possible 
outcomes in N trials. These regions of uncertainty can be derived from the multinomial 
distribution, Eq. (p7j), and for large their sizes and shapes can be read off the exponent 



in Eq. (31). (We could, for example, define "region of uncertainty" to be the range of values 
of {ni/N, . . . ,nd/N) for which the exponent has magnitude less than 1.) One can see that 
these regions of uncertainty will have different sizes and shapes, depending on the location 
in probability space. For example, the largest is at the exact center, as shown in Fig. |3} 
But if we change the axes of probability space from pi to 7^ = ^/pi, probability space then 
looks like a section of a sphere. Again one can speak of a region of uncertainty around each 
point on this spherical surface, but now it happens that all the regions of uncertainty have 
the same size and shape — in fact they are all spherical — as we can see in the exponent of 
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FIG. 3: Regions of uncertainty at different locations in the flat probability space. As one approaches 
an edge, the uncertainty shrinks along the direction perpendicular to the edge. 



Eq. (35) and as is illustrated in Fig. HI (The issue gets tricky near the edges. The closer 



one gets to the edge, the higher the value of must be in order to see this uniformity. But 
no matter how close one is to the edge — as long as one is not at the edge — there is always 
such a value of A^.) In this sense, there is something special about representing probability 
space as a section of a sphere: it captures geometrically the statistical fluctuations in a large 
sample. What is special about real-vector-space quantum theory is that its set of pure states 
mirrors this representation of probability space. 

As one would expect, the mathematical fact illustrated in Fig. |4]has been well noted in the 
statistics literature. Bhattacharyya in the 1940's proposed a distance measure between two 
probability distributions based on the angle between their 7 vectors |27]. The square-root 
construction has been particularly explicit in the genetics literature. One can see diagrams of 
the positive section of the unit sphere in papers by Cavalli-Sforza and collaborators from the 
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FIG. 4: Regions of uncertainty at different locations in the spherical probability space, with axes 
corresponding to the square roots of probability. Now all regions of uncertainty are isotropic and 
of the same size. 

1960's, and these authors give credit for the idea to R. A. Fisher [281 [29] (^^s do Mosteller and 
Tukey [SO])- In the present paper, I have used the square-root construction only to identify 
a special measure on probability space — the uniform measure on the spherical surface traced 
out by 7. But one can also use it to define a special metric on the space, and this is what 
Bhattacharyya, Cavalli-Sforza and others have done. (One can find in Ref. [31] a review a 
various "genetic distances," some of which are based on the square-root construction.) Such 
a metric has also been used in work on quantum foundations [161 132l43lj . However, I want 
to emphasize that this special feature of the representation of probability space in terms of 
square roots of probability arises without any reference to quantum theory. It is simply a 
matter of statistics. 

What about ordinary complex-vector-space quantum theory? In that theory each pure 
state is represented by a vector \s) in C^. The natural a priori distribution over pure 
states is the uniform distribution over the unit sphere in C^, that is, the unique distribution 
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invariant under all unitary transformations. (We could just as well speak of a distribution 
over projection operators \s){s\ so as not to have to worry about the irrelevant overall phase 
factor in the vector |s), but for our purposes either picture leads to the same result.) For a 
complete orthogonal measurement, the probabilities of the outcomes are given hj pi = 
where the Sj's are the components of \s) in the basis defined by the measurement. We 
can ask what measure this probability rule, together with the a priori distribution of state 
vectors \s), induces on probability space. That question was answered by Sykora in 1974: it 
induces the uniform distribution, not on the spherical surface defined by 7, but on the flat 
surface defined by (pi, . . . ,Pd) [35j. This is a remarkably simple and intriguing result, but 
this distribution is not the one that optimizes the transfer of information from preparation 
to measurement. 

V. OPTIMAL INFORMATION TRANSFER IN STANDARD QUANTUM ME- 
CHANICS? 

The real- vector-space theory thus has a certain elegance to it, in that there is an optimal 
correspondence between the set of pure states and the set of probability distributions over 
the outcomes of a complete orthogonal measurement. The complex theory does not have 
this property, but one might wonder whether this is because we are not asking the question 
in the right way. That is, by somehow reframing the problem, might it be possible to see 
that the usual complex theory does exhibit the property of optimal information transfer in 
some altered sense? 

For example, perhaps we are making a mistake to consider a complete orthogonal mea- 
surement. Such a measurement will never reveal the relative phases between the components 
of the state vector when it is written in the measurement basis. Instead we could con- 
sider a special case of a non-orthogonal measurement, namely, a symmetric informationally- 
complete measurement (a SIC) [361438] . In d complex dimensions, such a measurement has d"^ 
possible outcomes. Each outcome corresponds to a pure state |mj)(mj|, and the inner prod- 
uct between any two of these pure states has the same magnitude: |(mj|mj)p = l/{d + 1). 
Numerical evidence strongly indicates that such symmetric measurements exist for all val- 
ues of d up to 67 [39j, and it would be reasonable to guess that they exist for all d. Such 
symmetric measurements figure prominently in the quantum Bayesian approach to under- 
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standing quantum mechanics jlQl HI]. Is there a kind of optimal transfer of information 
from preparation to measurement that occurs when the measurement is a SIC? 

With (P possible outcomes, one can estimate (P — 1 independent parameters by repeat- 
ing the measurement on many identically prepared copies. This is exactly the number of 
parameters needed to specify a. d x d density matrix, and indeed, any density matrix can 
be reconstructed with arbitrary precision from a fixed SIC applied to many copies. (This is 
the meaning of "informationally complete.") To state the question of optimal information 
transfer, we would need to specify an a priori measure on the set of all dxd density matrices. 
The measure should be unitarily invariant, but there are many unitarily invariant measures 
on this set. Is there at least one such measure for which the mutual information between 
the preparation (of a general mixed state) and the measurement outcomes is optimal? 

One can see quickly that there is no such measure. The optimal a priori measure on 
probability space has already been determined in the preceding section. For d"^ possible 
outcomes, the optimal measure is the uniform measure over the (ci^ — l)-dimensional spherical 
surface of probability space, when the axes correspond to the square roots of the probabilities. 
This measure clearly assigns nonzero weight to every nonzero volume of probability space. 
But if one performs a SIC on any state, the largest possible value of any probability is 
1/d. Thus the SIC does not make use of the whole probability space; so it is not providing 
information optimally in our sense, no matter what weighting function we place on the set 
of density matrices. 

Let us try another version of the problem. Suppose we are given a specific entangled 
state of a pair of qubits, namely, the state 

|<|.+) = i=(|00) + |ll)). (41) 

We imagine the first qubit is held by Alice and the second by Bob. Now Alice applies a 
unit-determinant unitary transformation U to her qubit — an element of SU{2). She then 
sends her qubit to Bob, who performs a Bell measurement on the two qubits. That is, he 
distinguishes the four orthogonal states 

l<^>+) = 4^(100) + 111)) l$-) = ^(|oo)-|ii)) 

l^^) = ^(|oi) + |io)) l^-) = ^(|oi)-|io)) 

We imagine this whole procedure is repeated over and over — always with the same initial 
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state, the same U, and the same Bell measurement — so that Bob can try to gain information 
about U from the outcomes of his measurements. We assume he already knows the initial 
state 1$^). (This scenario is like superdense coding except that we are not restricting U 
to a discrete set. Really what Bob is doing here is a restricted kind of process tomography [l3l 
m] — trying to infer the process U from the outcomes of measurements.) We can ask whether 
the transfer of information is optimal between Alice's choice of unitary transformation and 
the outcomes of Bob's measurements. 

A general element of SU(2) can be represented as 

U = exp [i{e/2)n ■ a] , (43) 

where n is the unit vector defining the axis of the Bloch sphere around which Alice is rotating 
her qubit, 6 is the angle of rotation — it runs from zero to 2tt — and a is the vector of Pauli 
matrices. The transformations U are in one-to-one correspondence with the points of a 
three-dimensional spherical surface, which we can imagine embedded in four dimensions. 
Specifically, we can label the point corresponding to the above U by the unit vector 

vu = {cos{e/2),n^sm{e/2),nySm{e/2),n^sm{e/2)). (44) 

The natural measure on SU{2) is the uniform measure over this sphere — it is the unique 
measure that is invariant under left-multiplication (or right-multiplication) by any group 
element. 

To determine whether the information is transferred optimally from Alice to Bob, we 
need to compute the probabilities of Bob's outcomes. It is straightforward to do so, and one 



finds that the probabilities are, in an arrangement parallel to that given in Eq. (42), 

cos2(^/2) nlsm\e/2) 

(45) 

nl sm\e/2) nJsin2(^/2) 



These probabilities are the squared components of the unit vector vu given in Eq. (44). 
Thus the problem is equivalent to the case of real-vector-space quantum mechanics in four 
dimensions. So indeed, the information is transmitted optimally from Alice to Bob! 

Does this example generalize to higher dimensions? The answer is no, at least not in 
any way that I can see. For example, in three dimensions, we would probably want Alice 
and Bob to start with the state |$) = (|00) + |11) -|- \22))/\/3. Ahce will perform a general 
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unit-determinant unitary transformation U , and then Bob will measure both particles in the 
generalized Bell basis, which consists of the nine states 

|00) + 111) + |22) |00) +a;|ll) +w2|22) |00) + w^jll) + a;|22) 

|01) + |12) + |20) |01) +w|12) +w2|20) |01) +a;^|12) +a;|20) (46) 

|02) + |10) + |21) |02) +cc;|10) +w2|21) |02) + w^i^O) + u;|21). 

Here u = e^'^'^l'^ and I have suppressed the normalization factor 1 / v^- A counting of param- 
eters is initially encouraging: it takes eight real numbers to specify an element of SU (3), and 
Bob's measurement yields eight independent probabilities. However, one quickly discovers 
that, as in the case of the SIC, the measurement does not make use of the whole probability 
space. 

Consider specifically the probabilities of the second and third outcomes listed on the first 



row of Eq. (46 ); let us call these probabilities p2 and ^3 (we imagine a list of nine probabilities 
Pi, . . . , pg, of which these are the second and third). In terms of the components Uij of Alice's 
unitary matrix f/, we have 

V2 = \^\uqq\ u?ux\\ UJU22\ and Ps = ^ |moo + wun + c<;%2f , (47) 

so that the product has the value 

1 I |2 
= ^ + "11 + ^22 - %0^^11 - %0M22 - Mll^^22 • (48) 

oi 

Now, in the whole probability space the maximum value of is 1/4, attained when 
V2 = P3 = 1/2. But given that each Uij can have a magnitude no larger than 1, one can 



show that the expression in Eq. (48) cannot exceed the value 16/81 < 1/4 [51]. Thus a 
certain region of probability space is inaccessible in the scenario we are imagining. It follows 
that the information about U is not conveyed optimally to Bob through his measurement 
outcomes. 

Thus, as far as I can tell, the property of optimal information transfer does not easily 
carry over from the real-vector-space theory to ordinary quantum mechanics. 



VI. CONCLUSIONS 



In real-vector-space quantum theory, the number of parameters needed to specify a pure 
state is equal to the number of independent probabilities in a complete orthogonal measure- 
ment: both are equal to d — 1 for a state in d dimensions. So by measuring many copies 
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prepared in an unknown pure state, one can hope to pin the state down to a finite number 
of small regions in state space. In this paper we have seen that this pinning down is in fact 
optimal, in the sense that the observer gains as much information about the state as could 
possibly be gained in any probabilistic theory, at least when the number of trials is very 
large. 

Standard quantum theory, based on a complex vector space, does not have this property, 
and we have not been able to find a restatement of the problem for which the complex 
theory does achieve such an optimization (except for the case of a unitary transformation 
applied to a qubit). For our original statement of the problem, one can say that this lack of 
optimization comes from the fact that for any specific orthogonal measurement, a complete 
specification of a pure state includes phase factors that have no effect on the probabilities 
of the outcomes. The presence of these phase factors changes the natural a priori measure 
on probability space, and the mutual information is no longer maximized. 

Note that in the complex theory the number of real parameters needed to specify a pure 
state in d dimensions is 2c/ — 2 if we do not count an irrelevant overall phase factor. This 
number is exactly twice the number of independent probabilities an orthogonal measurement 
can access, and it seems that this doubling of the number of parameters is what spoils the 
optimization. It is reasonable to ask whether there can be some deeper understanding of 
this factor of two, but at this point it is hard to have confidence in any particular answer. 

In a sense, any axiomatization of quantum mechanics offers a potential answer to this 
question: whatever assumptions give rise to the structure of quantum theory also give rise to 
the factor of two. In his axiomatization, Goyal addresses the factor of two directly, formal- 
izing it in his principle of complementarity: for a measurement that at some level generates 
2d possible events, only d distinguishable outcomes can be observed, each corresponding 
to a pair of fundamental events. This principle, together with a principle of global gauge 
invariance, leads him to the basic structure of quantum mechanics [51]. One achieves a 
similar result by assuming that the underlying theory is real-vector-space quantum theory, 
but that because our knowledge is limited in some fundamental way we do not see the full 
real-vector-space structure; we have access only to those observables that satisfy Stueckel- 
berg's rule, that is, those observables that commute with the operator I ® J defined in the 
introduction. (Goyal in fact relates his work to Stueckelberg's. See also Ref. in which 
Stueckelberg's rule emerges dynamically.) Imposing Stueckelberg's rule on a real vector 
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space of 2d dimensions reduces the maximum number of orthogonal states from 2d to d, and 
it cuts in half the number of parameters required to specify a maximally pure state [S2]- 

While such an interpretation would give an important role to the real-vector-space theory, 
it raises a difficult question about the status of the main result in this paper. If the limitation 
on our knowledge is fundamental, then who are the observers for whom the transfer of 
information from preparation to measurement is optimal? Evidently it is not optimal for 
us, because whatever the underlying theory may be, the effective theory within which we 
live seems to be complex-vector-space quantum theory. 
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[47] At least in real-vector-space quantum theory, one never needs to make global measurements 
involving more than two subsystems [15j. 

[48] The experimenter is trying to determine the value of an unknown probability. It may seem 
that this problem cannot be framed except in the context of an objective interpretation of 
the concept of probability, but this is not the case. The representation theorem of de Finetti 
shows how to express this kind of question within a subjective interpretation |24] . We note 
that the quantum de Finetti theorem does not hold in real- vector-space quantum theory |25] , 
but this fact does not preclude a subjective interpretation of probability in our problem. In 
our problem the experimenter is trying to refine a distribution over ordinary probability space, 
to which the classical de Finetti theorem applies. 

[49] The differential entropy is not the limit of the entropy of a discretized version of the continuous 
variable. However, a mutual information involving a continuous variable, being the difference 
between two differential entropies, is indeed the limit of the discretized mutual information 

m- 

[50] Alternatively, instead of using differential entropies as in Eq. ([s]), we could have expressed 
the mutual information I{a : n) in terms of the Kullback-Leibler distances of both K(a) and 
P{a\n) from the uniform distribution over a. The calculation in Section III then tells us that / 
is maximized when the Kullback-Leibler distance between K{a) and the uniform distribution 
is minimized, that is, when K{a) is itself the uniform distribution. 

[51] In proving this inequality, we are free to set uqq equal to 1. Then let uu = —a and U22 = —b 
and the desired inequality becomes 

\1 + a + b + a'^ + - ab\ <4 

under the assumption that \a\ < 1 and \b\ < 1. (One can see that equality is achieved when 
a = b = 1.) This inequality is equivalent to 

+ B"^ + {A - Bf\ <8, 

where A = 1 + a and B = 1 + b. This last inequality can be proved by first noting that 

lA^ + B"^ + {A - Bf\ < \A^ + B^\ + \A- Bp 
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and then writing out the absolute values explicitly. One has to use the fact that A and B are 
both confined to a circle of unit radius in the complex plane, centered at the value 1. 
[52] When standard quantum mechanics is expressed in real-vector-space terms, what we normally 
call a pure state is represented by a density matrix of rank 2. 
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